Cluster based Mixed Coding Schemes for Inverted File Index Compression

نویسندگان

Jinlin Chen

Ping Zhong

Terry Cook

چکیده

One way to improve inverted file compression is to use the cluster property [1] of document collection, which states that term occurrences are not uniformly distributed. Some terms are more frequently used in some parts of the collection than in others. The corresponding part of the inverted list will consequently be small d-gap values clustered. Interpolative code [9] exploits the cluster property of term occurrences and achieves very good performance. Other codes that favor small d-gaps also perform well on document collections with cluster property.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A New Compression Based Index Structure for Efficient Information Retrieval

Finding desired information from large data set is a difficult problem. Information retrieval is concerned with the structure, analysis, organization, storage, searching, and retrieval of information. Index is the main constituent of an IR system. Now a day exponential growth of information makes the index structure large enough affecting the IR system’s quality. So compressing the Index struct...

متن کامل

Re-Ordered FEGC and Block Based FEGC for Inverted File Compression

Data compression has been widely used in many Information Retrieval based applications like web search engines, digital libraries, etc. to enable the retrieval of data to be faster. In these applications, universal codes (Elias codes (EC), Fibonacci code (FC), Rice code (RC), Extended Golomb code (EGC), Fast Extended Golomb code (FEGC) etc.) have been preferably used than statistical codes (Huf...

متن کامل

Re-Pair Compression of Inverted Lists

Compression of inverted lists with methods that support fast intersection operations is an active research topic. Most compression schemes rely on encoding differences between consecutive positions with techniques that favor small numbers. In this paper we explore a completely different alternative: We use Re-Pair compression of those differences. While Re-Pair by itself offers fast decompressi...

متن کامل

On the Impact of Random Index-Partitioning on Index Compression

The performance of processing search queries depends heavily on the stored index size. Accordingly, considerable research efforts have been devoted to the development of efficient compression techniques for inverted indexes. Roughly, index compression relies on two factors: the ordering of the indexed documents, which strives to position similar documents in proximity, and the encoding of the i...

متن کامل

Optimize Document Identifier Assignment for Inverted Index Compression

Document identifier assignment is a technique for inverted file index compression, by reducing d-gap value of posting lists. It was approached by either TSP or clustering methods in existing study. However, there is no proper formulation for this problem and the existing approaches has no theory guarantee to be good approximations. In this paper, we first formulate document identifier assignmen...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

JDIM

دوره 6 شماره

صفحات -

تاریخ انتشار 2008

Cluster based Mixed Coding Schemes for Inverted File Index Compression

نویسندگان

چکیده

منابع مشابه

A New Compression Based Index Structure for Efficient Information Retrieval

Re-Ordered FEGC and Block Based FEGC for Inverted File Compression

Re-Pair Compression of Inverted Lists

On the Impact of Random Index-Partitioning on Index Compression

Optimize Document Identifier Assignment for Inverted Index Compression

عنوان ژورنال:

اشتراک گذاری